Empirical Methods in Information Extraction
نویسنده
چکیده
Most corpus-basedmethods in natural language processing (NLP)were developed toprovide an arbitrary text-understanding application with one or more general-purpose linguistic capabilities. This is evident from the articles in this issue of AI Magazine. Charniak and Ng/Zelle, for example, describe techniques for part-of-speech tagging, parsing, and word-sense disambiguation. These techniques were created with no specific domain or high-level language-processing task in mind. In contrast, this article surveys the use of empirical methods for a particular natural language understanding task that is inherently domain-specific. The task is information extraction. Very generally, an information extraction system takes as input an unrestricted text and “summarizes” the text with respect to a prespecified topic or domain of interest: it finds useful information about the domain and encodes that information in a structured form, suitable for populating databases. In contrast to in-depth natural language understanding tasks, information extraction systems effectively skim a text to find relevant sections and then focus only on these sections in subsequent processing. The information extraction system in Figure 1, for example, summarizes stories about natural disasters, extracting for each such event the type of disaster, the date and time that it occurred, and data on any property damage or human injury caused by the event. Information extraction has figured prominently in the field of empirical NLP: The first largescale, head-to-head evaluations of NLP systems on the same text-understanding tasks were the DARPA-sponsored MUC performance evaluations of information extraction systems (Lehnert and Sundheim, 1991; Chinchor et al., 1993). Prior to each evaluation, all participating sites receive a corpus of texts from a predefined domain and the corresponding “answer keys” to use for system development. The answer keys are manually encoded templates — much like that of Figure 1 — that capture all information from the corresponding source text that is relevant to the domain, as specified in a set of written guidelines. After a short development phase, the NLP systems are evaluated by comparing the summaries each produces with the summaries generated by human experts for the same test set of previously unseen texts. The comparison is performed using an automated scoring program that rates each system according to measures of recall and precision. Recallmeasures the amount of the relevant information that theNLP system correctly extracts from the test collection while precision measures the reliability of the information extracted:
منابع مشابه
A review on EEG based brain computer interface systems feature extraction methods
The brain – computer interface (BCI) provides a communicational channel between human and machine. Most of these systems are based on brain activities. Brain Computer-Interfacing is a methodology that provides a way for communication with the outside environment using the brain thoughts. The success of this methodology depends on the selection of methods to process the brain signals in each pha...
متن کاملA review on EEG based brain computer interface systems feature extraction methods
The brain – computer interface (BCI) provides a communicational channel between human and machine. Most of these systems are based on brain activities. Brain Computer-Interfacing is a methodology that provides a way for communication with the outside environment using the brain thoughts. The success of this methodology depends on the selection of methods to process the brain signals in each pha...
متن کاملDetection of perturbed quantization (PQ) steganography based on empirical matrix
Perturbed Quantization (PQ) steganography scheme is almost undetectable with the current steganalysis methods. We present a new steganalysis method for detection of this data hiding algorithm. We show that the PQ method distorts the dependencies of DCT coefficient values; especially changes much lower than significant bit planes. For steganalysis of PQ, we propose features extraction from the e...
متن کاملPresenting an Empirical Correlation for Maximum Sauter Mean Diameter in a Spray Extraction Column
Based on the importance of drops' behavior in liquid-liquid extraction, the maximum sauter mean drop diameter has been investigated and correlated in a counter-current spray extraction column with two chemical systems. Spargers were set of nozzles in all experiments. Studying the effects of several parameters on drops size, some correlations were estimated by the last available version of softw...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملAn overview of empirical natural language processing.(Natural Language
In recent years, there has been a resurgence in research on empirical methods in natural language processing. These methods employ learning techniques to automatically extract linguistic knowledge from natural language corpora rather than require the system developer to manually encode the requisite knowledge. The current special issue reviews recent research in empirical methods in speech reco...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- AI Magazine
دوره 18 شماره
صفحات -
تاریخ انتشار 1997